Fast and accurate genome-wide predictions and structural modeling of protein–protein interactions using Galaxy

Background Protein–protein interactions play a crucial role in almost all cellular processes. Identifying interacting proteins reveals insight into living organisms and yields novel drug targets for disease treatment. Here, we present a publicly available, automated pipeline to predict genome-wide protein–protein interactions and produce high-quality multimeric structural models. Results Application of our method to the Human and Yeast genomes yield protein–protein interaction networks similar in quality to common experimental methods. We identified and modeled Human proteins likely to interact with the papain-like protease of SARS-CoV2’s non-structural protein 3. We also produced models of SARS-CoV2’s spike protein (S) interacting with myelin-oligodendrocyte glycoprotein receptor and dipeptidyl peptidase-4. Conclusions The presented method is capable of confidently identifying interactions while providing high-quality multimeric structural models for experimental validation. The interactome modeling pipeline is available at usegalaxy.org and usegalaxy.eu.


Background
Obtaining a complete map of interacting proteins is crucial to decipher the inner workings of living organisms. Among many other roles, proteins act in dynamic collaboration to fulfill biological functions by catalyzing chemical processes. Commonly, interactions are elucidated through a variety of experimental methods [33,47] which are capable of evaluating an ever-larger number of putative protein pairs. Unfortunately, the overlap between these methods is often limited which either indicates a high false positive rate or a low coverage. Often 40-90% of the detected interactions do not overlap between different methods [30,39]. Also, high throughput methods do not provide structural insights into the formed protein-protein complex. More reliable methods such as crystallography and NMR spectroscopy do yield structural information but are labor intensive and as such only applicable to a limited number of proteins. In a recent study we demonstrated that the gap between low and high throughput methods can be bridged by identifying distantly related protein-protein homologues with similar protein-protein interfaces [10]. Application of the SPRING method [11] to Escherichia coli competitively identified protein-protein interactions while producing accurate multimeric protein structure models of which 39 by now have been confirmed in high-resolution experiments. Other studies applied our method to the minimal synthetic genome syn3.0 [44] and the mouse genome [20]. In the present study we describe how we implemented our pipeline on Galaxy [1], a web-based computational workbench used by many scientists across the world to analyze large data sets. This allows scientists to reproduce, share and embed the resulting interactome networks within their own analysis pipelines. Given a set of query sequences and a list of known protein structures, the pipeline employs SPRING with HHsearch [35], and TMalign [46] to detect and structurally model protein-protein interactions. We validate the pipeline's performance by comparing the resulting Human and Yeast protein networks with experimental findings. Similar to the results for Escherichia coli, the method competitively resolves Human and Yeast protein-protein interaction networks. As novel targets, we identified Human proteins likely to bind the papain-like protease of SARS-CoV2's non-structural protein 3 (Nsp3). We also obtained models for SARS-CoV2's spike protein (S) in complex formation with myelin-oligodendrocyte glycoprotein (MOG) and dipeptidyl peptidase-4 (DPP4). Some of the detected interactions have already been experimentally confirmed in recent literature, others provide novel insights into the pathology of SARS-CoV2. Notably, the interaction with DPP4 has been suggested to cause a higher mortality rate of diabetics contracting SARS-CoV2 [37] while the MOG receptor is associated with the MOG antibody disease which relapses in SARS-CoV2 patients [41].

Performance validation with human and yeast interactomes
The pipeline's performance is validated on 20,610 raw protein coding gene sequences from the Human Reference Genome (UP000005640) of the UniProt database. This process evaluates ~ 212 million possible pairs to identify the set of interacting proteinprotein pairs. Each interaction is ranked by the Z com score and Matthew's correlation coefficient (MCC) is determined with regard to a negative data set of non-interacting protein pairs produced by the SPRING MCC tool and positive data sets derived from each experimental method. The negative data set has been sampled to contain proteins from different subcellular regions. Figure 1 displays the cross-validation performance results in comparison to ten experimental methods available in the BioGRID database. Note that we applied SPRING on the raw protein coding sequences without separating the individual proteins using the CDS record provided by GenBank [4]. In total the 20,610 protein coding genes encode for about 75,776 individual proteins.
We repeated the same experiment using the Yeast genome (UP000002311) to identify protein-protein interaction networks. In total 6045 protein coding genes were parsed through the pipeline evaluating ~ 18 million possible protein-protein interactions using the public Galaxy instance at https:// usega laxy. org. The results are shown in the right panel of Fig. 1.
A more detailed analysis regarding the prediction performance of the presented pipeline versus experimental methods has been recently published for the Escherichia coli genome [10]. The pipeline predicted several protein complex structures which were later experimentally verified by crystallography.
For all three genomes, our pipeline was able to implicitly identify individual protein sequences and achieve an overall performance which is comparable if not better than existing experimental methods.

SARS-CoV2 protease (Nsp3) and ubiquitin
We next applied the Galaxy pipeline to the genome of SARS-CoV2 which causes a novel severe acute respiratory syndrome and has been declared a pandemic [27]. The SARS-CoV2 genome contains 13-15 open reading frames with ~ 30 thousand nucleotides, including 11 protein-coding genes. Our pipeline identified Human substrates for the papain-like protease of SARS-CoV2 which is part of the non-structural protein 3 (Nsp3) (see Fig. 2). Table 1 shows a list of the highest ranking fifteen substrates with matching multimeric templates and model quality attributes i.e. SM-score, TM-score, E contact and Z com . The two highest ranking interactions were identified for ISG15 (SM-score = 1.11) and ANKUB (SM-score = 1.10). ISG15 has recently been experimentally confirmed as a substrate [32] and ANKUB was suggested in a computational cleavage enrichment study [29].

SARS-CoV2 spike protein (S) and myelin-oligodendrocyte glycoprotein
We also identified Human proteins interacting with SARS-CoV2's spike protein (S). The top-ranking interaction was found for angiotensin (ACE2/ACE), which is widely known to be the primary receptor for SARS-CoV2 [48].
The second highest ranking model was detected for the interaction with the myelin-oligodendrocyte glycoprotein (MOG, see Fig. 3). MOG is a protein located on the surface of myelin sheaths in the central nervous system [14]. Our pipeline modeled the monomeric structure of MOG with the highest ranking homologue in the PDB70 database which is PDB entry 4PFE [9] at a Z-score of 102.2. We compared the resulting monomeric model with the model provided by Mesleh et al. [26]. Both models resolve MOG as a beta-barrel and share significant similarity at a TM-score of 0.70. Additionally several suitable multimeric template frameworks were identified. The corresponding PDB entries are 7C8V [21], 6XC2 [43], 6XC4 [43], 7BZ5 [42] and 7C01 [32]. All of these structures, except 7C8V, were crystalized with a potent neutralizing antibody of SARS-CoV2. Table 2 shows the identified template frameworks and the resulting model scores.
The results indicate two distinct putative binding modes which may occur in tandem (see Fig. 3C).
The MOG receptor is associated with MOG antibody disease (MOGAD), a neuroinflammatory condition that may cause inflammation of the optic nerve, the spinal  cord and brain. Recent research has shown that SARS-CoV2 does trigger a relapse of MOGAD [41].

SARS-CoV2 Spike protein (S) and dipeptidyl peptidase-4
High-scoring models were also generated for dipeptidyl peptidase-4 (DPP4, see Fig. 4), confirming the computational modeling results presented by [19,22]. DPP4 is a cell surface glycoprotein receptor involved in T-cell activation and assumed to play a role in cell adhesion, migration and tube formation [8].  Additionally, inhibiting DPP4 prevents glucagon release while increasing insulin secretion to decrease blood glucose levels [24]. DPP4 is known to interact with MERS-CoV [40]. The highest scoring template frameworks for DPP4 were PDB entry 4KR0 [23] with a Z com score of 216.30 and PDB entry 4L72 [40] with 213.4. The resulting dimeric models are very similar to each other. The multimeric template matched the individual models with a TM-score of 0.62, a mean contact energy of -6.7 and SM-score of 0.69.
Several sites are known to significantly contribute to the interaction between DPP4 and MERS-CoV's receptor-binding domain (RBD). These are DPP4 residues K267, R336, R317, and Q344 ( [22,34], see Fig. 4) along with polymorphic sites as outlined in Table 3. Our method illustrates that SARS-Cov2's S protein interacts with sites on DPP4 shared by MERS-CoV in addition to novel interaction sites (see Table 3).  Additionally, half of fourteen critical binding sites [18] have been identified as polymorphic in Humans. Taken together, binding propensities between SARS-Cov2's S protein and DPP4 might vary based on the population.

Discussion
Accurate identification of protein-protein interactions is essential to decipher cellular processes and detect novel drug targets. In the present work we implemented a Galaxy pipeline using the SPRING method which detects and structurally models protein-protein interactions by identifying distantly related protein complex structures with similar protein-protein interfaces. The presented pipeline yields insights into the biochemical activity of SARS-CoV2 by identifying distant homologues with similar binding interfaces to Human proteins. For the papain-like protease of the non-structural protein 3 (Nsp3), we detected several ubiquitin-like substrates of which some have been experimentally confirmed. The method produced a top-ranking model for SARS-CoV's spike protein (S) and dipeptidyl peptidase-4 (DPP4) in alignment with existing literature. Our method produced novel complex models between the S protein and myelin-oligodendrocyte glycoprotein (MOG). Here two top-ranking binding modes were produced. Experimental exploration will be needed to determine what impact these novel binding sites might play in pathogenicity, immune evasion, and adaptation. The prediction confidence relies on the accuracy of the homology match between templates, the structural fit and a knowledgebased contact potential, providing likely binding modes and interaction partners for further investigation. Only additional experimental validation can determine which or if any of the predicted binding modes occur in nature.
A limitation of our method is that it may produce high-confidence models between proteins which are localized in different subcellular regions. Existing literature has shown that such cross-interactions occur in a significant number of cases. In the present work we avoid filtering predicted protein interaction pairs by their corresponding subcellular locations since this would bias the obtained Matthew's correlation coefficients. Identifying an accurate set of truly non-interacting protein pairs is critical and particularly challenging for the evaluation of protein-protein interactions. Randomly sampling protein pairs across a genome may lead to the inclusion of interacting protein pairs. A more accurate method is to sample non-interacting sets by pairing proteins from different subcellular regions as presented here. Yet another common suggestion is to exclude homologue protein pairs from the non-interacting set all together in order to avoid the inclusion of interacting pairs. This however is not an option due to the nature of the presented method which relies on homology detection to predict protein-protein interactions.
Another limitation is that homology modeling does rely on experimental templates. All of the fifteen most confident models derived for Human proteins interacting with Nsp3's protease rely on four crystallographic complex structures.

Conclusions
This pipeline demonstrates the ability to detect interactome networks for a range of organisms. The increasing number of resolved co-crystal structures in the PDB, will continually improve the model quality and coverage over time [6]. Since the pipeline includes all data preparation steps no manual adjustment is required once new data has been published to the PDB. Galaxy enables users to employ the pipeline within their own methodologies and add or modify steps as required using Galaxy's web-based workflow editor. Users are now able to reproduce and share the resulting interactome networks. The present contribution expands the repertoire of Galaxy tools to structural modeling methodologies, making them available for a large number of users. Recent advances in protein structure prediction and modeling [3,13,16] complement existing sequence analysis tools and provide novel targets for drug discovery and elucidating biochemical processes through structural insights.

Protein-protein interaction analysis pipeline
We present a Galaxy pipeline to predict and structurally model protein-protein interactions on a genomic scale. The pipeline takes the following inputs: (1) An individual file or a pair of files containing multiple FASTA entries of protein coding sequences. The pipeline will attempt to identify protein-protein interactions within the set of query sequences. (2) Text file containing the list of all Protein Data Bank (PDB) [5] entry identifiers to be employed as a multimeric template library. This step can be skipped if the library has already been constructed. (3) PDB70 threading library files as provided by the developers of HHsearch. These files are used to perform single-chain threading and can be obtained from http:// wwwus er. gwdg. de/ ~compb iol/ data/ hhsui te/ datab ases/ hhsui te_ dbs/.
The following outputs are generated: (1) Tabular file containing all identified interactions with their corresponding templates and Z com scores. (2) Tabular file containing the details of the produced multimeric structural models and the corresponding model properties, i.e. SM-score, TM-score, a knowledgebased contact energy term E contact and the fraction of inter chain clashes. This workflow is available at: https:// usega laxy. org/u/ guerl er/w/ inter action-predi ction.

Data preparation
The presented pipeline utilizes all protein-protein interfaces available in the PDB as a template library for interface homology detection (Fig. 5). The data preparation starts by using the DBkit tool in Galaxy to download all PDB entries and store them as a ffindex/ ffdata database pair. As of November 29th, 2020 this amounted to 170,860 files. Then the SPRING Cross tool is applied which scans each PDB entry for protein-protein interfaces and stores the corresponding interacting PDB chain identifiers in a 2-column lookup table as a pairwise index of all interactions. In more detail, the SPRING Cross method proceeds by using the PDB REMARK 350 entries to build all bio units available in a given PDB entry. Then all C-alpha atom distances between two separate chains within the same bio unit are determined. If more than five distances below 10Å are detected for a pair of PDB chains, the corresponding PDB chain identifiers are deemed as interacting and added as a new row to the resulting 2-column lookup table. This yields a complete set of 988,784 interacting PDB chain identifier pairs contained in the PDB which we will use as a multimeric template library.
In a consecutive step the SPRING Map tool is applied which uses PSI-BLAST [2] to detect close homologues of the PDB70 database for each PDB chain identifier listed in the columns of the lookup table. The identifiers of matching PDB70 entries are added in two additional columns to the lookup table. This allows us to apply HHsearch on a nonredundant subset of the PDB, containing entries with less than 70% sequence identity to each other. Although possible, expanding the monomeric threading database by including every PDB chain would significantly impact the database preparation time without improving the overall prediction performance. We used the PDB70 database issued on November 18, 2020 containing 58,900 entries. If a PSI-BLAST E-value equal to zero is used, 257,698 interaction frameworks which exactly match the sequences in the PDB70 are detected. With an E-value threshold of 0.001, the resulting 4-column lookup table contained 900,772 interaction frameworks suitable for the monomers available in the PDB70 database.

Interaction prediction
The pipeline's interaction prediction logic uses SPRING with HHsearch and TMalign, and was designed to exploit the redundancy of available protein-protein interfaces in order to predict and model novel protein tertiary structures.
Initially each query sequence Q is threaded by HHsearch against the PDB70 monomeric template library to identify a set of putative templates (T i , i = 1,2,…) each associated with a Z-score (Z i ). The Z-score is defined as the number of standard deviations by which the raw alignment score differs from its mean. A higher Z-score indicates a higher significance and usually corresponds to a better alignment.
Considering all possible target sequence pairs, the SPRING Min-Z tool uses the previously described 4-column lookup table to select interaction frameworks which are shared by the monomeric templates of a query pair. The Z-score of the framework is defined as Z com which is the smaller of the two monomeric Z-scores. A more detailed description of this algorithm is provided in [10,11].

Interaction validation
The accuracy of predicted protein-protein interactions is evaluated using the SPRING MCC tool. This tool compares the set of interactions from SPRING with interactions obtained from experimental methods contained in the Biological General Repository for Interaction Data sets (BioGRID) [28]. BioGRID is an open access database that contains protein interactions curated from primary biomedical literature for all major model organism species and Humans. The SPRING MCC tool accesses the BioGrid Tab 3.0 format columns 24 and 27, containing the UniProt [36] accession identifiers of interacting protein pairs. The method only operates on interactions identified for sequences which are available in the UniProt database.
Initially, the SPRING MCC tool produces a `negative` data set of non-interacting protein pairs by randomly sampling protein-protein interaction pairs from the set of query protein sequences. If a UniProt localization file is provided, the non-interacting pairs can be determined by sampling protein sequences from different subcellular regions. This approach can reduce the false-negative rate of the resulting negative data set.
Subsequently, the protein-protein interaction sets identified by each experimental method are considered to be truly interacting, constituting the `positive` data sets for the cross-validation process. Each method is compared to all other methods using the positive data sets and a negative data set of equal size. The resulting Matthew's correlation coefficients (MCC) are plotted using the Matplotlib library [12]. An example of such a plot is shown in Fig. 1, displaying the results for the Human and Yeast interactome validation. The legend lists the experimental methods used to determine the corresponding positive data set.

Structural modeling
If a pair of proteins, Chain A and Chain B, is deemed to potentially interact e.g. Z com > 25, the complex structure is constructed by structurally aligning the top-ranked monomer templates of Chain A and Chain B to all putative interacting frameworks using the SPRING Model tool which utilizes TM-align. The structural alignment is built on the subset of interface residues. The resulting models are evaluated by the recently established SPRING model score [38]: where TM-score is the smaller TM-score returned by TM-align when aligning the topranked monomer models of Chain A and Chain B to the interaction framework; E contact is a residue-specific, atomic contact potential derived from 3897 non-redundant structure interfaces from the PDB using the formula of RW [45]. The weight parameter w 0 is set to 0.01 through a training set of protein complexes to maximize the modeling accuracy of the interface structures. The final model is evaluated for clashes and removed if more than 10% of the resulting C-alpha atom contacts share a distance of less than 5Å between the interacting pair of protein structures.